# Reading packages
from bs4 import BeautifulSoup
import requests
import math
import pandas as pd
import pickle
import pandas as pd
import random
import numpy as np
import matplotlib.pyplot as plt
import os
import networkx as nx
import networkx.algorithms.community as nx_community
import re
import string
import networkx.algorithms.community as nx_comm
import netwulf
import warnings
warnings.filterwarnings("ignore", category=RuntimeWarning)
from collections import Counter
# Showing images in notebook
from IPython.display import Image
# Semantics analysis
%pip install textblob
import nltk
nltk.download('stopwords')
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from nltk.sentiment import SentimentIntensityAnalyzer
from textblob import TextBlob
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from collections import defaultdict
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from PIL import Image
Requirement already satisfied: textblob in /Users/fridajorgensen/opt/anaconda3/lib/python3.9/site-packages (0.17.1) Requirement already satisfied: nltk>=3.1 in /Users/fridajorgensen/opt/anaconda3/lib/python3.9/site-packages (from textblob) (3.7) Requirement already satisfied: regex>=2021.8.3 in /Users/fridajorgensen/opt/anaconda3/lib/python3.9/site-packages (from nltk>=3.1->textblob) (2022.7.9) Requirement already satisfied: tqdm in /Users/fridajorgensen/opt/anaconda3/lib/python3.9/site-packages (from nltk>=3.1->textblob) (4.64.1) Requirement already satisfied: joblib in /Users/fridajorgensen/opt/anaconda3/lib/python3.9/site-packages (from nltk>=3.1->textblob) (1.1.0) Requirement already satisfied: click in /Users/fridajorgensen/opt/anaconda3/lib/python3.9/site-packages (from nltk>=3.1->textblob) (8.0.4) Note: you may need to restart the kernel to use updated packages.
[nltk_data] Downloading package stopwords to [nltk_data] /Users/fridajorgensen/nltk_data... [nltk_data] Package stopwords is already up-to-date!
We chose to work with the characters of the Harry Potter universe as our data. We will use the Harry Potter API to get all characters in the universe, which will be nodes in a network. To create the network, we will use web scraping to determine edges between the nodes through hyperlinks on each character's page on the Harry Potter Wiki. To populate the nodes with attributes we will use a Harry Potter database API. Our last type of data will be the content of the characters' wiki pages. We will perform text processing on these pieces of text.
The Harry Potter universe is a fictional build, however it is extremely elaborated upon. Therefore, we found it interesting whether a fictional social network could be analyzed like a real one. The assumption is that the Harry Potter universe functions as a (real) network, and that Wiki pages are a good documentation of the universe (because the fictional world is so well-documented with info from books, movies, etc.). The purpose of the task is to investigate whether expectations about the Harry Potter universe align with what a network based on Wiki information will reveal. We will explore the Harry Potter universe by mapping connections on their wiki pages. This means that each character is a node, and edges come from references to other people in the text (directed, weighted edges by number of references). So really, our focus is on the data that fans of Harry Potter have compiled. It will not be a direct map of the characters in the stories, but instead of the version of the universe that the fans have compiled in the wiki.
Throughout the work of this notebook we considered the experience for the reader. In order to provide the reader with a better overview of what the large notebook holds, we arranged a table of contents with hyper links that redirects the user to the chosen section. Additionally we added a lot of figures as visualizations with specific axis labels and titles such that the reader easily can determine what the figure shows. Also when introducing a new theoretical term such as TD-IDF scores, assorsativity etc. we seek to explain the concept of the term briefly. This may provide the reader with a better understanding of both why the analysis is carried out as well as what the outcome of the analysis actually shows. We hope that this provides the reader with a great understanding of the entire analysis, thereby making it more interesting to read through. Lastly we find that if the reader is familiar with the Harry Potter universe, then the notebook reveals how the universe is actually composed.
We started out by querying the Harry Potter wiki for all the characters of the universe also collecting character attributes.
# API link
BASE_URL = "https://api.potterdb.com/"
VERSION = "v1/"
RESOURCE = "characters"
my_url = BASE_URL + VERSION + RESOURCE
print(my_url)
all_data = {}
counter = 0
for i in range(50):
url = my_url + f"?page[number={i+1}]&page[size=100]"
r = requests.get(url)
data = r.json()["data"]
for i in data:
all_data[counter] = {}
name = (i['attributes']['name'])
all_data[counter]['Name'] = i['attributes']['name']
all_data[counter]['Blood status'] = i['attributes']['blood_status']
all_data[counter]['House'] = i['attributes']['house']
all_data[counter]['Species'] = i['attributes']['species']
all_data[counter]['Death time'] = i['attributes']['died']
all_data[counter]['Alias'] = i['attributes']['alias_names']
all_data[counter]['Wiki'] = i['attributes']['wiki']
all_data[counter]['Gender'] = i['attributes']['gender']
counter +=1
df_data = pd.DataFrame.from_dict(all_data, orient='index')
https://api.potterdb.com/v1/characters
The next step was to webscrape the characters' wiki pages. We collect the wiki page content for text analysis and all hyper links to find references between characters.
# Function to webscrape Harry Potter wiki for text and links
def webscrapeWiki(url):
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
content_div = soup.find('div', {'id': 'mw-content-text'})
text = ""
exclude_sections = ["Appearances"]
exclude_links = ["Appearances"]
for p in content_div.find_all('p'):
section_heading = p.find_previous_sibling('h2')
if section_heading and section_heading.find('span', {'class': 'mw-headline'}).text in exclude_sections:
break
elif p.find_parents('table'):
continue
text += p.get_text() + " "
links = []
for ul in content_div.find_all("ul"):
section_heading = ul.find_previous_sibling('h2')
if section_heading and section_heading.find('span', {'class': 'mw-headline'}).text in exclude_sections:
break
elif ul.find_parents('table'):
continue
for li in ul.find_all("li"):
for a in li.find_all("a"):
if a.find_parents('table'):
continue
split = a['href'].split("/")
if len(split)>1:
if split[1]=="wiki" and section_heading is not None and section_heading.find('span', {'class': 'mw-headline'}).text not in exclude_links:
links.append(split[2])
infobox = soup.find('aside', {'class': 'portable-infobox'})
if infobox:
for a in infobox.find_all('a'):
href = a.get('href')
if href:
split = href.split("/")
if len(split)>1:
if split[1]=="wiki":
links.append(split[2])
return text, links
df_data["Wiki text"], df_data["Character links"] = zip(*df_data["Wiki"].apply(webscrapeWiki))
The texts and character links are added to our data, which now looks like this:
df_data.head(2)
| name | Blood status | House | Species | Death time | Alias | Wiki | Gender | Wiki text | Character links | Wiki name | Sum character links | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1992 Gryffindor vs Slytherin Quidditch match s... | None | None | None | None | None | https://harrypotter.fandom.com/wiki/1992_Gryff... | None | The title of this article is conjectural. Alt... | [Canon, Filius_Flitwick, Irma_Pince, Severus_S... | 1992_Gryffindor_vs_Slytherin_Quidditch_match_s... | {'Filius_Flitwick': 2, 'Irma_Pince': 3, 'Sever... |
| 1 | 1996 Gryffindor Quidditch Keeper trials specta... | None | None | None | None | None | https://harrypotter.fandom.com/wiki/1996_Gryff... | None | In September 1996, a number of unidentified sp... | [Harry_Potter_and_the_Half-Blood_Prince, Septe... | 1996_Gryffindor_Quidditch_Keeper_trials_specta... | {'Harry_Potter': 1, 'Ronald_Weasley': 1, 'Corm... |
We create a new column containing a character's wiki name (end of wiki URL), which is a unique identifier. Furthermore, we make sure to only keep the references webscraped from the wiki that are actually referencing other characters in our data set.
def wikiName(wikiURL):
return wikiURL.split("/")[-1]
def checkLinks(linkCounter, wikiNames):
cleanLinks = linkCounter.copy()
for link in linkCounter.keys():
if link not in list(wikiNames):
del cleanLinks[link]
return cleanLinks
df_data["Wiki name"] = df_data["Wiki"].apply(wikiName)
df_data["Sum character links"] = df_data["Character links"].apply(Counter)
df_data["Sum character links"] = df_data["Sum character links"].apply(checkLinks,wikiNames=df_data["Wiki name"])
#Save dataframe
df_data.to_pickle("df_data")
The data is saved in a pickle in order to improve the effectiveness of the code.
# Load dataframe:
df_data_loaded = pd.read_pickle("df_data")
The original data set consisted of 4066 characters, but some of the characters had empty attributes. We concluded that characters with not information attached, were not central to the story and should be removed. Furthermore, if a character name included 'Unidentified' they were also irrelevant. Furthermore, some of the data is actually the actors from the Harry Potter movies, therefore we webscraped all the actor names and removed them from the data set. Lastly, some nodes were groups of people, eg. "Arthur Weasley's ten unidentified subordinates". We want to map the connections between individual characters and as such remove all entries with the species 'Humans'. Lastly, we made an algorithm to check if a character has a hyperlink going to or from another character in the data set. If not they are removed. This filter functions like removing characters with degrees zero.
To summarize, we apply five filters to the data:
Before applying the fifth filter, we make sure to clean the webscraped character links again to make sure they only contain references to characters still left in the data set.
# Getting filtered data
# Removing names holding 'Unidentified'
def checkUnidentified(wikiName):
check = "Unidentified"
if len(wikiName)>len(check):
if wikiName[:len(check)]==check:
return True
return False
# Checking if links are persons
def checkLinks(linkCounter, wikiNames):
cleanLinks = linkCounter.copy()
for link in linkCounter.keys():
if link not in list(wikiNames):
del cleanLinks[link]
return cleanLinks
def removeDegreeZeros(df):
hasDegreeDict = {}
# Check if each character has out-degree:
for i in df.index:
hasDegreeDict[df["Wiki name"][i]] = False
if len(df["Sum character links"][i]):
hasDegreeDict[df["Wiki name"][i]] = True
# Look at all character in-degree references:
for i in df.index:
for n in df["Sum character links"][i]:
hasDegreeDict[n] = True
remove = [name for (name, degrStatus) in hasDegreeDict.items() if not degrStatus]
df_degree = df.drop(df[df["Wiki name"].isin(remove)].index)
return df_degree
# Function that scrapes actor names from each HP movies wikipediapage and appends them to a list
def scrape_actors(url, actor_names):
response = requests.get(url)
if response.status_code == 200:
html_content = response.content
soup = BeautifulSoup(html_content, 'html.parser')
cast_element = soup.find('table', class_='cast_list')
if cast_element:
for row in cast_element.find_all('tr'):
actor_element = row.find('td', class_='primary_photo')
if actor_element:
actor_name = actor_element.find('img').get('alt')
if actor_name not in actor_names:
actor_names.append(actor_name)
return actor_names
actor_names = []
# The Philosopher's Stone
url = 'https://www.imdb.com/title/tt0241527/fullcredits#cast'
actor_names = scrape_actors(url, actor_names)
# The Chamber of Secrets
url = 'https://www.imdb.com/title/tt0295297/fullcredits#cast'
actor_names = scrape_actors(url, actor_names)
# The Prisoner of Azkaban
url = 'https://www.imdb.com/title/tt0304141/fullcredits#cast'
actor_names = scrape_actors(url, actor_names)
# The Goblet of Fire
url = 'https://www.imdb.com/title/tt0330373/fullcredits#cast'
actor_names = scrape_actors(url, actor_names)
# The Order of the Phoenix
url = 'https://www.imdb.com/title/tt0373889/fullcredits#cast'
actor_names = scrape_actors(url, actor_names)
# Half-Blood Prince
url = 'https://www.imdb.com/title/tt0417741/fullcredits#cast'
actor_names = scrape_actors(url, actor_names)
# The Deathly Hallows – Part 1
url = 'https://www.imdb.com/title/tt0926084/fullcredits#cast'
actor_names = scrape_actors(url, actor_names)
# The Deathly Hallows – Part 2
url = 'https://www.imdb.com/title/tt1201607/fullcredits#cast'
actor_names = scrape_actors(url, actor_names)
len_data = len(df_data_loaded)
# Filter data after having some information of Blood status, house, species, death time or alias
filtered_data = df_data_loaded[~((df_data_loaded["Blood status"].isnull())&(df_data_loaded["House"].isnull())&(df_data_loaded["Species"].isnull())&(df_data_loaded["Death time"].isnull())&(df_data_loaded["Alias"].isnull()))]
filt_data1 = len(filtered_data)
# Filter unidentified people away
filtered_data = filtered_data[~filtered_data["Wiki name"].apply(checkUnidentified)]
filt_data2 = len(filtered_data)
# Cleaning 'name' column for actornames
mask = ~filtered_data['Name'].isin(actor_names)
filtered_data = filtered_data[mask]
filt_data3 = len(filtered_data)
# Filter based on species (removing groups)
filtered_data = filtered_data[~(filtered_data['Species']=='Humans')]
filt_data4 = len(filtered_data)
# Remove links to non-characters:
filtered_data["Sum character links"] = filtered_data["Sum character links"].apply(checkLinks,wikiNames=filtered_data["Wiki name"])
# Remove characters with no links out or in:
filtered_data = removeDegreeZeros(filtered_data)
filt_data5 = len(filtered_data)
print('First filter removes:', len_data-filt_data1, 'characters')
print('Second filter removes:', filt_data1-filt_data2, 'characters')
print('Third filter removes:', filt_data2-filt_data3, 'characters')
print('Fourth filter removes:', filt_data3-filt_data4, 'characters')
print('Fifth filter removes:', filt_data4-filt_data5, 'characters')
First filter removes: 91 characters Second filter removes: 431 characters Third filter removes: 4 characters Fourth filter removes: 56 characters Fifth filter removes: 1731 characters
Next we cleaned the house names. Some house names had unnecessary information attached to the house name, which we removed. Furthermore, if they had long house names it typically was an explanation, which we changed to 'Unknown'. Lastly, None was changed to 'Unknown' to have one collective category.
Also the column describing the species of the characters are cleaned. Thereby gathering all characters who has no specific species or more than 1 species assigned.
def cleanHouses(House):
if House != None:
strings = House.split(' ')
if len(strings) == 1:
return strings[0]
if len(strings) != 1:
if len(strings) > 2:
return 'Unknown'
if strings[-1][-1] == ')':
return strings[0]
if House == None:
return 'Unknown'
def cleanSpecies(species):
if not species:
return 'Unknown'
elif species =="Human (formerly), Ghost":
return "Ghost, Human (formerly)"
elif species[:4]=="Dog":
return "Dog"
else:
return species
def cleanGenders(gender):
# Cleans the gender category into Males, Females, Mixed, and Unknown
if gender:
if len(gender)>4:
# If male with extra description:
if gender[:4]=="Male" and gender[4]!="s":
return "Male"
# If part of a group:
elif gender[:5]=="Males":
return "Mixed"
# If Female with extra description:
elif len(gender)>7 and gender[:6]=="Female":
return "Female"
else:
return gender
else:
return gender
else:
# If none change to unknown:
return "Unknown"
# Applying cleaning functions
filtered_data['House'] = filtered_data['House'].apply(cleanHouses)
filtered_data['Species'] = filtered_data['Species'].apply(cleanSpecies)
filtered_data['Gender'] = filtered_data['Gender'].apply(cleanGenders)
The data now looks like this:
filtered_data.head(2)
| Name | Blood status | House | Species | Death time | Alias | Wiki | Gender | Wiki text | Character links | Wiki name | Sum character links | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 9 | Aberforth Dumbledore | Half-blood | Unknown | Human | None | [Ab] | https://harrypotter.fandom.com/wiki/Aberforth_... | Male | \n\n\n\n\n\n\n\nAberforth Dumbledore\n\n\nBiog... | [Charm, Aberforth_Dumbledore%27s_goat_charm, P... | Aberforth_Dumbledore | {'Albus_Dumbledore': 2, 'Gellert_Grindelwald':... |
| 10 | Abernathy | None | Unknown | Human | None | None | https://harrypotter.fandom.com/wiki/Abernathy | Male | \n\n\n\n\n\n\n\nAbernathy\n\n\nBiographical in... | [Apparition, Lestrange_Mausoleum, Abernathy%27... | Abernathy | {'Percival_Graves': 1, 'Peter_Pettigrew': 1, '... |
First we investigate the species and visualized the different types. This is to understand the characters better.
# Top 10 species
top_species = pd.DataFrame(filtered_data['Species'].value_counts().head(5))
top_species
| Species | |
|---|---|
| Human | 1518 |
| Ghost, Human (formerly) | 16 |
| Goblin | 10 |
| House-elf | 9 |
| Giant | 9 |
Another way to investigate the distribution of species among charaters is by making a wordcloud of the species column in the dataset. Meaning that species that occur more often will be larger in the wordcloud and vice versa. Below both a function for making the wordcloud as well as the actual wordcloud for the speciec column is shown.
# Define a function to generate wordcloud
def generate_freq_wordcloud(data):
wordcloud = WordCloud(width = 800, height = 800,
background_color ='white',
min_font_size = 10)
wordcloud.generate_from_frequencies(frequencies = data)
plt.figure(figsize = (3,3), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)
plt.show()
species_freq = dict(Counter(filtered_data['Species']))
generate_freq_wordcloud(species_freq)
Next we look at the distribution of characters into houses as this division plays an essential role in the lore of the universe.
# Getting data ready for plotting in a histogram
houses = filtered_data['House']
houses_types = []
for i in houses:
if i not in houses_types:
houses_types.append(i)
houseDict = {}
for i in range(len(houses_types)):
dict[i] = houses_types[i]
keys = list(houseDict.keys())
values = list(houseDict.values())
houses_int = []
for i in houses:
for j in range(len(values)):
if i == values[j]:
houses_int.append(keys[j])
dict_len = []
for i in range(len(houses_types)):
dict_len.append(len(filtered_data[filtered_data['House']==houses_types[i]]))
len_houses = pd.DataFrame()
len_houses['House'] = houseDict.values()
len_houses['Count'] = dict_len
import plotly.express as px
# Plotting the distribution of house sizes
fig = px.bar(len_houses, x='House', y='Count', title="Count of members in each house", width=800, height=400)
fig.update_traces(width=0.8, marker_color=[('#8A8C8A'),('#ffc500'), ('#1a472a'), ('#0a5ea8'), ('#7f0909'), ('#8A8C8A'), ('#8A8C8A'),('#8A8C8A')], opacity=0.75)
fig.update_xaxes(griddash = 'solid')
fig.update_layout(title_x=0.5)
fig.show()
# Save figure as png
#fig.write_image("houses_count.png")
From the plot it is clear, that the characters assigned with a unknown house is much greater than the rest of the house sizes.
The code below saves the plot above as an interactive plot and outputs the html used on the website to display the plot:
import chart_studio
import chart_studio.plotly as py
import chart_studio.tools as tls
username = 's204052' # your username
api_key = 'qhytXY114toklsH4mae9' # your api key - go to profile > settings > regenerate key
chart_studio.tools.set_credentials_file(username=username, api_key=api_key)
url = py.plot(fig, filename = 'house_count', auto_open=False)
tls.get_embed(url)
We know the majority of the characters are humans, but to understand the data even better we look at the gender distribution of the data. We have plotted the gender distribution below. We expect to see a somewhat even distribution of the genders, with the potential of a large 'Unknown' class as there are many small characters or non-humans in the data set.
genderCount = Counter(filtered_data['Gender'])
genders = pd.DataFrame()
genders['Gender'] = genderCount.keys()
genders['Count'] = genderCount.values()
# Plotting the distribution of house sizes
fig = px.bar(genders, x='Gender', y='Count', title="Count of genders in data set", width=800, height=400)
fig.update_traces(width=0.8, marker_color=[('#0a5ea8'),('#7f0909'),('#8A8C8A'), ('#ffc500'), ('#1a472a'), ('#8A8C8A'), ('#8A8C8A'),('#8A8C8A')], opacity=0.75)
fig.update_xaxes(griddash = 'solid')
fig.update_layout(title_x=0.5, height=400)
fig.show()
There is a clear majority of males in the data set with almost half as many females. This is unexpected, as there is no apparent reason in the books or movies to explain this. Furthermore, there is a mixed category. Upon further inspection below, we see that it describes groups. One of our filters above had the intention of removing groups, but have missed non-human groups like 'Hogwarts house-elves' and 'Hogwarts house-mice'. This was discovered too late in the project to account for. However, it is only 8 data entries that have the 'Mixed' gender and they still function as nodes with references to other characters, so it does not affect the rest of our analysis.
filtered_data[filtered_data['Gender']=='Mixed'].head(3)
| Name | Blood status | House | Species | Death time | Alias | Wiki | Gender | Wiki text | Character links | Wiki name | Sum character links | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1555 | Hogwarts house-elves | None | Unknown | House-elves | None | None | https://harrypotter.fandom.com/wiki/Hogwarts_h... | Mixed | \n\n\n\n\n\n\n\nHogwarts house-elves\n\n\nPhys... | [Nonsuch, Rooky, House-elf, Hogwarts_kitchens,... | Hogwarts_house-elves | {} |
| 1557 | Hogwarts school mice | None | Unknown | Mice | None | None | https://harrypotter.fandom.com/wiki/Hogwarts_s... | Mixed | \n\nHogwarts school mice\n\nBiographical infor... | [1994, Mouse, Hogwarts_School_of_Witchcraft_an... | Hogwarts_school_mice | {'Minerva_McGonagall': 1} |
| 1582 | House of Black house-elves | None | Unknown | House-elves | Varied; between 1850 and 1985 | None | https://harrypotter.fandom.com/wiki/House_of_B... | Mixed | \n\n\n\n\n\n\n\nHouse of Black house-elves\n\n... | [Scrope%27s_relatives, Scrope, Phineas_Nigellu... | House_of_Black_house-elves | {'Phineas_Nigellus_Black': 1, 'Kreacher': 1, '... |
The code below saves the plot above as an interactive plot and outputs the html used on the website to display the plot:
url = py.plot(fig, filename = 'gender_count', auto_open=False)
tls.get_embed(url)
'<iframe id="igraph" scrolling="no" style="border:none;" seamless="seamless" src="https://plotly.com/~s204052/4.embed" height="525" width="100%"></iframe>'
Our data set mostly contain categorical attributes (not yet looking at number of references to other characters or text lengths). Looking at the distribution of species, we found that most characters are human. This is not surprising as wizards are humans too, and the stories are centered around wizards. As we would expect, we also found many different magical creatures in the species wordcloud even though they were generally not highly represented. Looking at the houses, we found that most characters are not assigned a house, which was surprising as our intial hypothesis about the whole universe was the houses would be central. Besides the 'Unknown' house, 'Slytherin' and 'Gryffindor' were the largest, which is likely because the feud between these two houses is central to the Harry Potter books/movies. From our distribution plot of the genders, we found a majority of males with half as many females. Furthermore, we found that the data set still contains some groups. We tried to filter groups out based on the species 'Humans', but did not consider groups of other species. This mistake was caught late in the process but is inconsequential to the rest of our analysis.
First, we initialized the graph using all the characters wiki names as nodes and their house as the group attribute of the node.
# making node lists
wiki_names = list(filtered_data['Wiki name'])
character_links = filtered_data['Sum character links']
houses = list(filtered_data['House'])
species = list(filtered_data['Species'])
genders = list(filtered_data['Gender'])
# Initializing graph
G = nx.DiGraph()
# Adding node with group according to house attribute
for i in range(len(wiki_names)):
G.add_node(wiki_names[i], group = houses[i], species = species[i], house = houses[i], gender = genders[i] )
print(G)
DiGraph with 1751 nodes and 0 edges
Next we used the references found on the wiki pages to make weighted edges between the nodes.
# Making weighted edgelist
edges = []
for j in filtered_data.index:
name = filtered_data['Wiki name'][j]
links = filtered_data['Sum character links'][j]
for key, val in links.items():
edges.append(((name, key, val)))
# Adding edgelist to graph
G.add_weighted_edges_from(edges)
print(G)
DiGraph with 1751 nodes and 7852 edges
Earlier, we filtered out characters with not edges, but as an extra check we will attempt to remove 0 degree nodes.
# Remove nodes with no edges
remove = [node for node,degree in G.degree() if degree == 0]
G.remove_nodes_from(remove)
print(f"{len(remove)} nodes with degree 0 were removed.") # Should be 0 as they are already removed
0 nodes with degree 0 were removed.
# Interactive visualization of graph (with group attribute according to 'house')
netwulf.interactive.visualize(G)
(None, None)
# Image of visualized graph
from IPython.display import Image
Image(url="static/images/total_network.png", width=500, height=500)
The nodes are colored according to the characters 'House' attribute, meaning which house they are a member of at Hogwarts. In the cell below the color of each specific house is specified.
# House and node color pairs
from tabulate import tabulate
color_dict = [['Unknown','Orange'],
['Gryffindor', 'Dark orange'],
['Slytherin', 'Light blue'],
['Ravenclaw', 'Dark blue'],
['Hufflepuff', 'Beige'],
['Wampus', 'Pink'],
['Thunderbird', 'Purple'],
['Pukwudgie', 'Green']]
print(tabulate(color_dict, headers=['House','Color of nodes']))
House Color of nodes ----------- ---------------- Unknown Orange Gryffindor Dark orange Slytherin Light blue Ravenclaw Dark blue Hufflepuff Beige Wampus Pink Thunderbird Purple Pukwudgie Green
This reveals that a large amount of all characters is not assigned to a house ('unknown' house). Also it is clear that a lot of characters only are linked to 1 or 2 other people, seen from the outer ring of small grouped nodes. Nevertheless, the characters in the center of the network are very interferred among each other.
In figure 2 the network is sized by strength, meaning the nodes are arranged according to the sum of all weights of incoming edges. Then it is clear that the most connected character in the Harry Potter universe is 'Harry Potter' as he has the largest sum of weigths of incoming edges.
# Showing graph sized by strength
from IPython.display import Image
Image(url="static/images/network_sized_strength.png", width=500, height=500)
To compare out Harry Potter graph, we create a random graph. It has the same number of nodes and edges. It will still be directed, but the in- and out-degrees of the nodes will not be upheld. The edges will have the same distribution of weights, but they will be assigned the edges randomly.
# Initliazing random graph with same number of nodes and edges as Harry Potter network
rand_G = nx.gnm_random_graph(len(G.nodes(data=True)), len(G.edges()), seed=42, directed=True)
print(rand_G)
DiGraph with 1751 nodes and 7852 edges
#collect lists:
weights = []
for (name1, name2, weight) in G.edges.data():
weights.append(weight['weight'])
# randomize list:
random.shuffle(weights)
rand_G.edges.data("weight",default=1)
for tupl in rand_G.edges.data():
tupl[2]['weight'] = weights.pop()
# Visualizing random graph
netwulf.interactive.visualize(rand_G)
(None, None)
# Showing random network sized by strength
from IPython.display import Image
Image(url="static/images/random_network.png", width=500, height=500)
The random network looks much more clustered and it looks like the degree of each node is more equally distributed such that there are no lonely nodes, as there is in the Harry Potter network. At the same time the sum of incoming edge weights (strength) is more similar across all nodes.
To start understanding the Harry Potter network we will investigate the degrees. Since we are working with a directed graph we will be working with both the in- and out-degrees. First we will find the top characters based on in- and out-degree.
# Finding top 10 nodes (characters) with highest in-degree and out-degree
in_degrees = {}
for i in G.in_degree:
new_name=i[0].replace('_', ' ')
in_degrees[new_name] = i[1]
out_degrees = {}
for i in G.out_degree:
new_name=i[0].replace('_', ' ')
out_degrees[new_name] = i[1]
max_in_deg = sorted(in_degrees, key=in_degrees.get, reverse=True)[:10]
max_out_deg = sorted(out_degrees, key=out_degrees.get, reverse=True)[:10]
in_df = pd.DataFrame(sorted(in_degrees.values(), reverse=True)[:10],max_in_deg,columns=['In-degree'])
out_df = a = pd.DataFrame(sorted(out_degrees.values(), reverse=True)[:10],max_out_deg,columns=['Out-degree'])
in_df
| In-degree | |
|---|---|
| Harry Potter | 210 |
| Tom Riddle | 187 |
| Albus Dumbledore | 111 |
| Ronald Weasley | 94 |
| Hermione Granger | 90 |
| Sirius Black | 67 |
| Ginevra Weasley | 65 |
| Severus Snape | 65 |
| Arthur Weasley | 62 |
| Draco Malfoy | 62 |
out_df
| Out-degree | |
|---|---|
| Harry Potter | 92 |
| Albus Dumbledore | 73 |
| Jacob's sibling | 62 |
| Ronald Weasley | 60 |
| Hermione Granger | 59 |
| Arthur Weasley | 49 |
| Bellatrix Lestrange | 48 |
| Ginevra Weasley | 48 |
| Cedrella Black | 45 |
| Tom Riddle | 43 |
Next, we will plot the degree distributions. The distributions are heavy tailed, so we have used a logarithmic scale.
# Using matplotlib to plot the distribution of in-degrees and out-degrees
in_degrees = [degree for character, degree in G.in_degree()]
out_degrees = [degree for character, degree in G.out_degree()]
bins = np.logspace(0, np.log10(max(in_degrees)),30)
hist, edges = np.histogram(list(in_degrees), bins = bins, density = True)
x = (edges[1:] + edges[:-1])/2
xx, yy = zip(*[(i, j) for (i,j) in zip(x, hist) if j>0])
fig, ax = plt.subplots(dpi=100,figsize=(7,5))
ax.plot(xx, yy, marker='.', label='In-degree', color=('#7f0909'))
hist, edges = np.histogram(list(out_degrees), bins = bins, density = False)
x = (edges[1:] + edges[:-1])/2
xx, yy = zip(*[(i, j) for (i,j) in zip(x, hist) if j>0])
ax.plot(xx, yy, marker='.', label='Out-degree', color=('#0a5ea8'))
ax.set_xlabel('Number of characters')
ax.set_ylabel('Probability density of degree')
ax.set_yscale("log")
ax.set_xscale("log")
ax.set_title("Distribution of degrees with logarithmic binning")
ax.vlines(187, 0, max(yy), ls="--", colors=('#7f0909'), label='In-degree (Tom Riddle)')
ax.vlines(43, 0, max(yy), ls="--", colors=('#0a5ea8'), label='Out-degree (Tom Riddle)')
ax.legend(loc='upper right')
ax.grid(color='grey', linestyle='-', linewidth=0.2)
plt.savefig('degree_distribution_network.png', dpi=300)
plt.show()
The plot above does not show the relation between in- and out-degree per character. To investigate whether characters with a high in-degree also has a high out-degree we plot the relation between the two measures below:
# Plotting in-degree vs out-degree (finding an relationship)
in_degrees = [degree for character, degree in G.in_degree()]
out_degrees = [degree for character, degree in G.out_degree()]
fig, ax = plt.subplots(figsize=(10,5) ,dpi=100)
ax.scatter(in_degrees, out_degrees, c=('#0a5ea8'), s=4)
ax.set_yscale("log")
ax.set_xscale("log")
ax.plot([0, 10, 50, 100, 150, 175, 200], [0, 10, 50, 100, 150, 175, 200], "--", color=('#7f0909'))
ax.set_xlabel('In-degree')
ax.set_ylabel('Out-degree')
ax.set_title('Out- versus in-degree for all characters in Harry Potter network')
ax.grid(color='grey', linestyle='-', linewidth=0.2)
plt.savefig('in_out_scatter.png', dpi=300)
So far, we have not considered the weights of the edges. They work as attributes of the edge and can be extracted from each edge. Below we visualize the distribution of the weights. This distribution is also heavy tailed, wherefore we plot it in logarithmic scale.
# Looking at the distribution of edge weights
all_weights = []
for i in list(G.edges(data=True)):
w = list(i[2].values())
all_weights.append(w[0])
bins = np.logspace(0, np.log10(max(all_weights)),30)
hist, edges = np.histogram(list(all_weights), bins = bins, density = True)
x = (edges[1:] + edges[:-1])/2
xx, yy = zip(*[(i, j) for (i,j) in zip(x, hist) if j>0])
fig, ax = plt.subplots(dpi=100,figsize=(7,5))
ax.plot(xx, yy, marker='.', label='Edge weights', color=('#7f0909'))
ax.set_xlabel('Edge weight')
ax.set_ylabel('Probability density of weight among characters')
ax.set_yscale("log")
ax.set_xscale("log")
ax.set_title("Distribution of edge weights")
ax.vlines(14, 0, max(yy), ls="--", colors='black', label='Albus Dumbledore > Harry Potter')
ax.vlines(7, 0, max(yy), ls="--", colors='grey', label='Harry Potter > Albus Dumbledore')
ax.legend(loc='upper right')
ax.grid(color='grey', linestyle='-', linewidth=0.2)
plt.savefig('edge_weight_distribution_network.png', dpi=300)
plt.show()
Using a logarithmic scale, we see an almost linear relation between the density and edge weight meaning the majority of the edges have a very low weight and a few edges have very high weights. In the plot, we have show the edges between Harry Potter and Albus Dumbledore. We see that the Dumbledore wiki page references Harry more times than the Harry Potter wiki references Dumbledore.
We investigate how much the network mixes across the attributes: species, gender and house. A high mixing pattern means that nodes of the same attributes tend to be connected. We compare the attributes of a node to the nodes that it points to with a directed edge.
# Mixing patterns based on gender or house:
def computeMixingPattern(G, attributes):
fractions = { attr: [] for attr in attributes}
node_attrs = { attr: nx.get_node_attributes(G, attr) for attr in attributes}
for node in G.nodes():
neighbors = G.neighbors(node)
count = { attr: 0 for attr in attributes}
for n in neighbors:
for attr in attributes:
#if nx.get_node_attributes(G, attr)[n] == nx.get_node_attributes(G, attr)[node]:
if node_attrs[attr][n]==node_attrs[attr][node]:
count[attr] += 1
if G.out_degree[node]>0:
for attr in attributes:
fractions[attr].append(count[attr] / G.out_degree[node])
return fractions
attributes = ["house" , "species", "gender"]
fractions = computeMixingPattern(G,attributes)
for attr in attributes:
av_frac = np.mean(fractions[attr])
print(f'The average mixing value of {attr} attribute: {av_frac}')
harry_means = { attr: np.mean(fractions[attr]) for attr in attributes}
The average mixing value of house attribute: 0.5725365285792122 The average mixing value of species attribute: 0.8480627902506895 The average mixing value of gender attribute: 0.5290805930462767
We know have the mixing patterns of the Harry Potter network, but we have no understanding of what the metrics mean, are these high or low mixing patterns? Therefore, we make a randomization experiment. We copy the graph structure but randomly assign attributes. Then we compute the mixing patterns for these random graphs and visualize the distributions. We computed 500 random attribute networks. If the Harry Potter network values lie outside of these distributions the values are not random for our network.
# Making a copy of the network
def shuffleNetworkAttr(attributes):
copy_G = G.copy()
node_attrs = { attr: list(nx.get_node_attributes(copy_G, attr).copy().values()) for attr in attributes}
for attr in attributes:
random.shuffle(node_attrs[attr])
nodes = list(copy_G.nodes())
node_dict = {}
for i in range(len(copy_G)):
node_dict[nodes[i]] = {}
for attr in attributes:
node_dict[nodes[i]][attr] = node_attrs[attr][i]
nx.set_node_attributes(copy_G, node_dict)
return copy_G
attributes = ["house" , "species", "gender"]
X = { attr: [] for attr in attributes}
# For loop for 100 repetitions
for i in range(500):
copy_G = shuffleNetworkAttr(attributes)
fractions = computeMixingPattern(copy_G, attributes)
for attr in attributes:
av_frac = np.mean(fractions[attr])
X[attr].append(av_frac)
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
fig = make_subplots(rows=1, cols=3)
colors = [('#ffc500'), ('#1a472a'), ('#0a5ea8')]
for i, key in enumerate(X.keys()):
trace = go.Histogram(x=X[key], nbinsx=20, name=key, marker_color=colors[i])
fig.add_trace(trace, row=1, col=i+1)
fig.add_vline(x=harry_means[key], line_width=3, line_dash="dash", row=1, col=i+1, name="Mean from Harry Potter network", line_color=colors[i])
fig.update_layout(title="Histograms of mixing patterns in random attribute networks of House, Species, and Gender",
xaxis_title="Value",
yaxis_title="Frequency",
showlegend=True,
legend=dict(yanchor="top", y=0.99, xanchor="right", x=0.99))
fig.show()
The dotted lines are the mixing pattern values from the Harry Potter network. We see that the values all lie outside the random distributions. Therefore, there are significant mixing patterns in the Harry Potter network. We see that nodes of the same gender and species tend to link to each other. However, nodes of the same houses do not tend to link to each other.
The code below outputs html code to show the interactive plot on a website.
url = py.plot(fig, filename = 'attr_mixing_patterns', auto_open=False)
tls.get_embed(url)
'<iframe id="igraph" scrolling="no" style="border:none;" seamless="seamless" src="https://plotly.com/~s204052/8.embed" height="525" width="100%"></iframe>'
import scipy.stats as stats
# find mean, median, mode, min, max of both in and out degrees of Harry Potter network and random network
def findStats(G):
min_in = np.min([d for n, d in G.in_degree()])
min_out = np.min([d for n, d in G.out_degree()])
print('Minimum of in-degree: ', min_in, ', and out-degree:', min_out)
max_in = np.max([d for n, d in G.in_degree()])
max_out = np.max([d for n, d in G.out_degree()])
print('Maximum of in-degree: ', max_in, ', and out-degree:', max_out)
mean_in = np.mean([d for n, d in G.in_degree()])
mean_out = np.mean([d for n, d in G.out_degree()])
print('Mean of in-degree: ', mean_in, ', and out-degree:', mean_out)
median_in = np.median([d for n, d in G.in_degree()])
median_out = np.median([d for n, d in G.out_degree()])
print('Median of in-degree:', median_in,', and out-degree:', median_out)
mode_in = stats.mode([d for n, d in G.in_degree()], keepdims=True)
mode_out = stats.mode([d for n, d in G.out_degree()], keepdims=True)
print('Mode of in-degree:', mode_in.mode,', and out-degree:', mode_out.mode)
# Harry Potter network
findStats(G)
Minimum of in-degree: 0 , and out-degree: 0 Maximum of in-degree: 210 , and out-degree: 92 Mean of in-degree: 4.484294688749286 , and out-degree: 4.484294688749286 Median of in-degree: 1.0 , and out-degree: 2.0 Mode of in-degree: [0] , and out-degree: [1]
# Average clustering coefficient of Harry Potter network
av_clust = nx.average_clustering(G)
print('Average clustering coefficient of Harry Potter network:', av_clust)
Average clustering coefficient of Harry Potter network: 0.30371066676110486
from tabulate import tabulate
# Finding the top 10 characters with the highest clustering coefficient
count0 = 0
count1 = 0
all_coef = list(nx.clustering(G).items())
for i in list(nx.clustering(G).items()):
if i[-1] == 0.0:
count0 += 1
all_coef.remove(i)
if i[-1] == 1.0:
count1 += 1
all_coef.remove(i)
print('Number of characters with clustering coefficient equal to 0:', count0)
print('Number of characters with clustering coefficient equal to 1:', count1)
all_coef.sort(key = lambda x: x[1], reverse=True)
print(tabulate(all_coef[0:5], headers=['Character', 'Clustering coefficient']))
Number of characters with clustering coefficient equal to 0: 817 Number of characters with clustering coefficient equal to 1: 200 Character Clustering coefficient ---------------- ------------------------ Lucy_Weasley 0.9848 Sirius_Black_I 0.981481 Stamford_Jorkins 0.98 Audrey_Weasley 0.9792 Mary_Riddle 0.96875
in_in = nx.degree_assortativity_coefficient(G, x='in', y='in')
in_out = nx.degree_assortativity_coefficient(G, x='in', y='out')
out_in = nx.degree_assortativity_coefficient(G, x='out', y='in')
out_out = nx.degree_assortativity_coefficient(G, x='out', y='out')
# Putting all the assortativity coefficients in a table
rows = [['In-degree', in_in, in_out], ['Out-degree', out_in, out_out]]
print(tabulate(rows, headers=['', 'In-degree', 'Out-degree']))
In-degree Out-degree ---------- ----------- ------------ In-degree 0.043646 0.12583 Out-degree 0.070133 0.199737
# Finding largest connected component
ll_hp = max(nx.strongly_connected_components(G), key=len)
print('Largest strongly connected subgraph in Harry Potter network is:', len(ll_hp))
# Average shortest path in largest connected component
avg_short_path = nx.average_shortest_path_length(G.subgraph(ll_hp))
print('Average shortest path in the largest connected component in Harry Potter network is:', avg_short_path)
Largest strongly connected subgraph in Harry Potter network is: 562 Average shortest path in the largest connected component in Harry Potter network is: 3.659466763088283
# Distribution of the shortest path
lengths = []
for i in ll_hp:
lengths.append(nx.shortest_path_length(G.subgraph(ll_hp), source=i))
lengths = [x for y in lengths for x in y.values()]
The distribution of shortest path is plotted by the distribution for shortest paths section.
# Average closeness centrality of random network
av_close_hp = []
for i in nx.closeness_centrality(G).values():
av_close_hp.append(i)
print('Average closeness centrality of Harry Potter network: ', np.mean(av_close_hp))
Average closeness centrality of Harry Potter network: 0.06403777053280169
The distribution of closeness centrality among nodes in the Harry Potter network is plotted in the section distribution for closeness centrality.
# Statistics of random network
findStats(rand_G)
Minimum of in-degree: 0 , and out-degree: 0 Maximum of in-degree: 13 , and out-degree: 13 Mean of in-degree: 4.484294688749286 , and out-degree: 4.484294688749286 Median of in-degree: 4.0 , and out-degree: 4.0 Mode of in-degree: [4] , and out-degree: [4]
# Computing clustering coefficient using networkx function
clust_rand = nx.average_clustering(rand_G)
print('Clustering coefficient of random network: ', clust_rand)
Clustering coefficient of random network: 0.002668296101460343
in_in = nx.degree_assortativity_coefficient(rand_G, x='in', y='in')
in_out = nx.degree_assortativity_coefficient(rand_G, x='in', y='out')
out_in = nx.degree_assortativity_coefficient(rand_G, x='out', y='in')
out_out = nx.degree_assortativity_coefficient(rand_G, x='out', y='out')
# Putting all the assortativity coefficients in a table
rows = [['In-degree', in_in, in_out], ['Out-degree', out_in, out_out]]
print(tabulate(rows, headers=['', 'In-degree', 'Out-degree']))
In-degree Out-degree ---------- ----------- ------------ In-degree 0.0198941 0.00270551 Out-degree 0.0212653 0.00019944
# Finding largest connected component
ll_rand = max(nx.strongly_connected_components(rand_G), key=len)
print('Largest strongly connected subgraph in random network is:', len(ll_rand))
# Average shortest path in largest connected component
avg_short_path = nx.average_shortest_path_length(rand_G.subgraph(ll_rand))
print('Average shortest path in the largest connected component in random network is:', avg_short_path)
Largest strongly connected subgraph in random network is: 1717 Average shortest path in the largest connected component in random network is: 5.149643697401414
# Distribution of the shortest path for random network
lengths_rand = []
for i in ll_rand:
lengths_rand.append(nx.shortest_path_length(rand_G.subgraph(ll_rand), source=i))
lengths_rand = [x for y in lengths_rand for x in y.values()]
# Average closeness centrality of random network
av_close_rand = []
for i in nx.closeness_centrality(rand_G).values():
av_close_rand.append(i)
print('Average closeness centrality of random network: ', np.mean(av_close_rand))
Average closeness centrality of random network: 0.19165483740264888
bins = np.linspace(min(lengths), max(lengths),30)
hist, edges = np.histogram(list(lengths), bins = bins, density = False)
x = (edges[1:] + edges[:-1])/2
xx, yy = zip(*[(i, j) for (i,j) in zip(x, hist) if j>0])
fig, ax = plt.subplots(dpi=100,figsize=(7,5))
ax.plot(xx, yy, marker='.', label='Harry Potter network', color=('#7f0909'))
hist, edges = np.histogram(list(lengths_rand), bins = bins, density = False)
x = (edges[1:] + edges[:-1])/2
xx, yy = zip(*[(i, j) for (i,j) in zip(x, hist) if j>0])
ax.plot(xx, yy, marker='.', label='Random network', color=('#0a5ea8'))
ax.set_xlabel('Length of path')
ax.set_ylabel('Density of count')
ax.set_title("Distribution of length of shortest path in networks")
ax.legend(loc='upper right')
ax.grid(color='grey', linestyle='-', linewidth=0.2)
#plt.savefig('shortest_path_distribution.png', dpi=300)
plt.show()
# Plotting distribution of closeness centrality
bins = np.linspace(min(av_close_rand), max(av_close_rand),30)
hist, edges = np.histogram(list(av_close_rand), bins = bins, density = False)
x = (edges[1:] + edges[:-1])/2
xx, yy = zip(*[(i, j) for (i,j) in zip(x, hist) if j>0])
fig, ax = plt.subplots(dpi=100,figsize=(7,5))
ax.plot(xx, yy, marker='.', label='Random network', color=('#7f0909'))
hist, edges = np.histogram(list(av_close_hp), bins = bins, density = False)
x = (edges[1:] + edges[:-1])/2
xx, yy = zip(*[(i, j) for (i,j) in zip(x, hist) if j>0])
ax.plot(xx, yy, marker='.', label='Harry Potter network', color=('#0a5ea8'))
ax.set_xlabel('Closeness centrality')
ax.set_ylabel('Density of closeness centrality among characters')
ax.set_title("Distribution of closeness centrality in networks")
ax.legend(loc='upper right')
ax.grid(color='grey', linestyle='-', linewidth=0.2)
plt.savefig('closeness_cent_distribution_network.png', dpi=300)
plt.show()
In order to investigate and compare different community splits, we seek to find the communities made up from splitting all characters into categories based on their house attribute. Afterwards the Louvain comminuties for the Harry Potter network is found, and the splits can be evaluated and compared by computing the modularity.
# Finding communities from splitting graph by 'House' attribute
import networkx.algorithms.community as nx_comm
houses = list(filtered_data['House'])
houses_types = []
for i in houses:
if i not in houses_types:
houses_types.append(i)
print('The different houses are:', houses_types)
houses_com = []
for i in houses_types:
houses_com.append({x for x,y in G.nodes(data=True) if y['group']==i})
# computing modularity of 'house' split
house_mod = nx_comm.modularity(G, houses_com)
print('The modularity for the "house" split is: ' + str(house_mod))
no_comm = len(houses_com)
print('The number of "house" communities is ' + str(no_comm))
size_comm = []
for i in range(no_comm):
size_comm.append(len(houses_com[i]))
The different houses are: ['Unknown', 'Hufflepuff', 'Slytherin', 'Ravenclaw', 'Gryffindor', 'Thunderbird', 'Pukwudgie', 'Wampus'] The modularity for the "house" split is: 0.1336468956913941 The number of "house" communities is 8
# Finding louvain communities
louv_comm = nx_comm.louvain_communities(G)
modularity_louv = nx_comm.modularity(G, louv_comm)
print('The modularity of Louvain algorithm is ' + str(modularity_louv))
no_comm = len(louv_comm)
print('The number of Louvain communities is ' + str(no_comm))
size_comm_lou = []
for i in range(no_comm):
size_comm_lou.append(len(louv_comm[i]))
The modularity of Louvain algorithm is 0.5699260711894045 The number of Louvain communities is 376
bins = np.logspace(0, np.log10(max(size_comm)),50)
hist, edges = np.histogram(list(size_comm), bins = bins, density = True)
x = (edges[1:] + edges[:-1])/2
xx, yy = zip(*[(i, j) for (i,j) in zip(x, hist) if j>0])
fig, ax = plt.subplots(dpi=100,figsize=(7,5))
ax.plot(xx, yy, marker='.', label='House', color=('#7f0909'))
ax.vlines(np.mean(size_comm), 0, max(yy), ls="--", colors=('#7f0909'), label='Mean of House communities sizes')
hist, edges = np.histogram(list(size_comm_lou), bins = bins, density = True)
x = (edges[1:] + edges[:-1])/2
xx, yy = zip(*[(i, j) for (i,j) in zip(x, hist) if j>0])
ax.plot(xx, yy, marker='.', label='Louvain', color=('#0a5ea8'))
ax.set_xlabel('Size of community')
ax.set_ylabel('Probability density of count')
ax.set_title("Distribution of communities sizes with logarithmic bins")
ax.vlines(np.mean(size_comm_lou), 0, max(yy), ls="--", colors=('#0a5ea8'), label='Mean of Louvain communities sizes')
ax.legend(loc='upper right')
ax.grid(color='grey', linestyle='-', linewidth=0.2)
ax.set_yscale("log",base=10)
ax.set_xscale("log",base=10)
plt.savefig('houses_communities.png', dpi=300)
plt.show()
To explain why the mean of the house community sizes is very skewed towards right, a table of the house community sizes is arranged below:
# House community sizes
from tabulate import tabulate
columns = ['House', 'Size']
rows = []
for i in range(len(houses_types)):
rows.append([houses_types[i], size_comm[i]])
print(tabulate(rows, columns))
House Size ----------- ------ Unknown 1399 Hufflepuff 61 Slytherin 118 Ravenclaw 61 Gryffindor 107 Thunderbird 3 Pukwudgie 1 Wampus 1
# Fraction of the unknown community of all
frac_unknown = size_comm[0]/sum(size_comm)
print('Fraction of unknown house community of all: ' + str(frac_unknown))
Fraction of unknown house community of all: 0.7989720159908623
Here it is seen, that the distribution of members in each house community is very unequally distributed. The unknown house is approximately 80% of all characters.
import statistics
shortest = min(filtered_data['Wiki text'], key=len)
longest = max(filtered_data['Wiki text'], key=len)
avg_length = sum(len(text) for text in filtered_data['Wiki text']) / len(filtered_data['Wiki text'])
median_length = statistics.median(len(text) for text in filtered_data['Wiki text'])
print('The shortest wiki text is', len(shortest),'characters (with spaces)')
print('The longest wiki text is', len(longest),'characters (with spaces)')
print('The average length of a wiki text is', (avg_length),'characters (with spaces)')
print('The median is', (median_length),'characters (with spaces)')
The shortest wiki text is 126 characters (with spaces) The longest wiki text is 285806 characters (with spaces) The average length of a wiki text is 4486.303479749002 characters (with spaces) The median is 1004 characters (with spaces)
wiki_lengths = [len(text) for text in filtered_data['Wiki text']]
fig, axes = plt.subplots(figsize=(10,5), dpi=100)
axes.hist(wiki_lengths, bins=50, color='blue', alpha=0.5, label='Wiki text')
axes.axvline(sum(wiki_lengths)/len(wiki_lengths), color='navy', linestyle='--', label='Mean')
axes.axvline(sorted(wiki_lengths)[len(wiki_lengths)//2], color='orange', linestyle='--', label='Median')
axes.set_title('Distribution of Wiki text lengths')
axes.set_xlabel('Length')
axes.set_ylabel('Count')
axes.legend()
# plt.savefig('Distribution_wiki_text.png', dpi=300)
plt.show()
wiki_lengths = [len(text) for text in filtered_data['Wiki text']]
bins = np.logspace(0, np.log10(max(wiki_lengths)),50)
hist, edges = np.histogram(list(wiki_lengths), bins = bins, density = True)
x = (edges[1:] + edges[:-1])/2
xx, yy = zip(*[(i, j) for (i,j) in zip(x, hist) if j>0])
fig, ax = plt.subplots(dpi=100,figsize=(7,5))
ax.plot(xx, yy, marker='.', label='Length of Wiki text', color=('#7f0909'))
ax.set_xlabel('Length of Wiki text')
ax.set_ylabel('Density of count')
ax.set_title("Distribution of Wiki text lengths")
ax.vlines(np.mean(wiki_lengths), 0, max(yy), ls="--", colors=('#0a5ea8'), label='Mean of lengths')
ax.vlines(np.median(wiki_lengths), 0, max(yy), ls="--", colors=('#0a5ea8'), label='Median of lengths')
ax.legend(loc='upper right')
ax.grid(color='grey', linestyle='-', linewidth=0.2)
ax.set_yscale("log",base=10)
ax.set_xscale("log",base=10)
#plt.savefig('Distribution_wiki_text.png', dpi=300)
plt.show()
# Function to tokenize text
def tokenize_text(text):
# Remove URLs
text = re.sub(r'http\S+', '', text)
# Remove punctuation
text = text.translate(str.maketrans('', '', string.punctuation))
# Remove numbers
text = re.sub(r'\d+', '', text)
# Convert to lowercase
text = text.lower()
# Tokenize
tokens = word_tokenize(text)
# Remove stop words
stop_words = set(stopwords.words('english'))
tokens = [token for token in tokens if token not in stop_words and token != 'none' and token != 'nan']
return tokens
# Add column 'tokens' to df
filtered_data['tokens'] = filtered_data['Wiki text'].apply(tokenize_text)
token_lengths = [len(tokens) for tokens in filtered_data['tokens']]
# Plot the distribution of the tokenized text lengths
fig, axes = plt.subplots(figsize=(10,5), dpi=100)
axes.hist(token_lengths, bins=50, color='blue', alpha=0.5, label='Tokenized text')
axes.axvline(sum(token_lengths)/len(token_lengths), color='navy', linestyle='--', label='Mean')
axes.axvline(sorted(token_lengths)[len(token_lengths)//2], color='orange', linestyle='--', label='Median')
axes.set_title('Distribution of tokenized text lengths')
axes.set_xlabel('Length')
axes.set_ylabel('Count')
axes.legend()
# plt.savefig('Distribution_wiki_text_tokenized.png', dpi=300)
plt.show()
token_lengths = [len(tokens) for tokens in filtered_data['tokens']]
bins = np.logspace(0, np.log10(max(token_lengths)),50)
hist, edges = np.histogram(list(token_lengths), bins = bins, density = True)
x = (edges[1:] + edges[:-1])/2
xx, yy = zip(*[(i, j) for (i,j) in zip(x, hist) if j>0])
fig, ax = plt.subplots(dpi=100,figsize=(7,5))
ax.plot(xx, yy, marker='.', label='Length of tokenized Wiki text', color=('#7f0909'))
ax.set_xlabel('Length of tokenized Wiki text')
ax.set_ylabel('Density of count')
ax.set_title("Distribution of tokenized Wiki text lengths")
ax.vlines(np.mean(wiki_lengths), 0, max(yy), ls="--", colors=('#0a5ea8'), label='Mean of tokenized lengths')
ax.vlines(np.median(wiki_lengths), 0, max(yy), ls="--", colors=('#0a5ea8'), label='Median of tokenized lengths')
ax.legend(loc='upper right')
ax.grid(color='grey', linestyle='-', linewidth=0.2)
ax.set_yscale("log",base=10)
ax.set_xscale("log",base=10)
#plt.savefig('Distribution_wiki_text_tokenized.png', dpi=300)
plt.show()
def compute_difference(row):
return len(row["Wiki text"]) - len(row["tokens"])
filtered_data["length_difference"] = filtered_data.apply(compute_difference, axis=1)
average_wiki_length = filtered_data["Wiki text"].str.len().mean()
print(f"Average length of 'Wiki text': {average_wiki_length:.2f}")
average_tokens_length = filtered_data["tokens"].str.len().mean()
print(f"Average length of 'tokens': {average_tokens_length:.2f}")
average_difference = filtered_data["length_difference"].mean()
print(f"Average difference: {average_difference:.2f}")
Average length of 'Wiki text': 4486.30 Average length of 'tokens': 405.92 Average difference: 4080.38
We see that tokenizing the text drastically reduced the amount of words in each charecters wikipedia text. This makes sense, since Wikipedia articles often are written in a very formal style and aim to provide comprehensive and neutral information on a particular topic. As a result, they often contain a lot of common and generic words known as stopwords. By tokenizing the text we also remove things such as punctuation, numbers, and special characters, which also plays a part in the reduction.
house_documents = {}
for house in filtered_data['House'].unique():
if house is None or house == "None":
continue
house_df = filtered_data[filtered_data['House'] == house]
house_document = ' '.join(house_df['tokens'].explode().tolist())
house_documents[house] = house_document
# Print the names of the house documents
print("Names of house documents:")
for house in house_documents:
print(house)
# Count the number of house documents
num_houses = len(house_documents)
print("Number of house documents:", num_houses)
Names of house documents: Unknown Hufflepuff Slytherin Ravenclaw Gryffindor Thunderbird Pukwudgie Wampus Number of house documents: 8
The code below generates word clouds for each of the four Hogwarts houses based on the TF-IDF scores of words in their respective documents. The TF-IDF scores represent the importance of each word in a document, taking into account both the frequency of the word in the document and its frequency in the corpus as a whole.
The code uses the TfidfVectorizer function from the sklearn.feature_extraction.text module to calculate the TF-IDF scores for each document. It then selects the top 100 words with the highest TF-IDF scores for each house, and generates a word cloud for each house based on these top 100 words.
We print the top 10 TF-IDF words for each in order to get a better idea of the top words.
house_colors = {
"Gryffindor": "#7F0909",
"Hufflepuff": "#FFC500",
"Ravenclaw": "#0A5EA8",
"Slytherin": "#1A472A"
}
def generate_wordcloud_tf_idf(tfidf_scores, ax):
for house, tfidf in tfidf_scores.items():
data = dict(tfidf)
color = house_colors.get(house, 'white')
wordcloud = WordCloud(width=800, height=800, background_color=color,
min_font_size=10, colormap='viridis')
wordcloud.generate_from_frequencies(frequencies=data)
ax.imshow(wordcloud)
ax.axis("off")
ax.set_title(house, fontsize=16)
vectorizer = TfidfVectorizer(max_features=5000, stop_words='english', use_idf=True)
selected_houses = ['Gryffindor', 'Slytherin', 'Hufflepuff', 'Ravenclaw']
fig, axs = plt.subplots(2, 2, figsize=(10,10))
for i, house in enumerate(selected_houses):
document = house_documents.get(house)
if document is None:
continue
tfidf = vectorizer.fit_transform([document])
words = vectorizer.get_feature_names()
scores = tfidf.toarray()[0]
# Top 100 words
top100_scores = sorted(zip(words, scores), key=lambda x: x[1], reverse=True)[:100]
top10_scores = sorted(zip(words, scores), key=lambda x: x[1], reverse=True)[:10]
print("Top 10 words and scores for", house, ":")
for word, score in top10_scores:
print(word, score)
print("--------------------------")
generate_wordcloud_tf_idf({house: top100_scores}, axs[i//2, i%2])
plt.tight_layout(pad=0)
# plt.savefig('wordclouds.png', dpi=300)
plt.show()
/Users/fridajorgensen/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead. warnings.warn(msg, category=FutureWarning)
Top 10 words and scores for Gryffindor : harry 0.5861462680076663 ron 0.2347650556979932 hogwarts 0.22301403005913825 hermione 0.22199220174271608 dumbledore 0.1720503427775826 weasley 0.1438223355364202 year 0.14011820788938983 school 0.13564770900504286 potter 0.13168812427890694 voldemort 0.11661615661167997 --------------------------
/Users/fridajorgensen/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead. warnings.warn(msg, category=FutureWarning)
Top 10 words and scores for Slytherin : harry 0.3269527068515053 voldemort 0.2878410851562119 black 0.23671478228654724 hogwarts 0.23543662471480561 snape 0.2165198926530297 death 0.18967858364645576 draco 0.16743864189815166 family 0.1633485376685785 school 0.15414580315203885 slytherin 0.14980006740811735 --------------------------
/Users/fridajorgensen/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead. warnings.warn(msg, category=FutureWarning)
Top 10 words and scores for Hufflepuff : newt 0.31767164296338185 harry 0.27576602197672295 hogwarts 0.27509012486403495 school 0.21966656162361511 year 0.20141733958103786 hufflepuff 0.16491889549588334 jacobs 0.15883582148169093 penny 0.15613223303093873 tonks 0.1541045416928746 sibling 0.15207685035481047 --------------------------
/Users/fridajorgensen/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead. warnings.warn(msg, category=FutureWarning)
Top 10 words and scores for Ravenclaw : harry 0.41612785521950924 hogwarts 0.28492216362015854 school 0.26241138319870133 year 0.21031500565190034 luna 0.19680853739902598 ravenclaw 0.17108193120307488 potter 0.14599849016202254 lockhart 0.1414963340777311 professor 0.13763734314833845 students 0.12863303097975556 --------------------------
## Special wordclouds
top_10_characters = ['Harry James Potter', 'Tom Marvolo Riddle', 'Albus Percival Wulfric Brian Dumbledore', 'Ronald Bilius Weasley', 'Hermione Jean Granger', 'Sirius Black III', 'Ginevra Molly Potter (née Weasley)', 'Severus Snape', 'Arthur Weasley', 'Draco Lucius Malfoy']
df = filtered_data[filtered_data['Name'].isin(top_10_characters)]
# Load the image paths
image_paths = {
"Harry James Potter": "Charecter_png/harry_potter.png",
"Tom Marvolo Riddle": "Charecter_png/tom_riddle.png",
"Albus Percival Wulfric Brian Dumbledore": "Charecter_png/Albus.png",
"Ronald Bilius Weasley": "Charecter_png/ron_weasley.png",
"Sirius Black III": "Charecter_png/siruis_black.png",
"Ginevra Molly Potter (née Weasley)": "Charecter_png/ginny.png",
"Severus Snape": "Charecter_png/Snape.png",
"Arthur Weasley": "Charecter_png/Arthur.png",
"Hermione Jean Granger": "Charecter_png/hermione_granger.png",
"Draco Lucius Malfoy": "Charecter_png/Draco_Malfoy.png",
}
for character in image_paths:
# Load image
img = Image.open(image_paths[character])
documents = df[df['Name'] == character]['tokens']
text = ' '.join(documents.apply(lambda x: ' '.join(x)))
mask = np.array(img)
stopwords = set(STOPWORDS)
wc = WordCloud(background_color="white", stopwords=stopwords, mask=mask, width=3000, height=3000,
max_words=100, max_font_size=300, min_font_size=5)
wc.generate(text)
image_colors = ImageColorGenerator(mask)
# Figure
fig, ax = plt.subplots(figsize=(5,5))
ax.imshow(wc.recolor(color_func=image_colors), interpolation='bilinear')
ax.imshow(mask, cmap=plt.cm.gray, alpha=0.2, interpolation='bilinear')
ax.axis('off')
ax.set_title(character, fontsize=40, color='#7F0909')
plt.tight_layout(pad=0)
# plt.savefig(f"static/images/{character}.png", format='png', dpi=300)
plt.show()
We have performed various types of analysis on the characters in the Harry Potter universe including mapping the characters as a social network and analyzing their descriptibe wikipedia texts. We started out with a hypothesis that the division into the 'Hogwarts' houses would be a central theme in the data. However, it was quickly apparent from the distribution of characters into houses that the majority of the data points had 'Unknown' as the house attribute. When we compared the community division based on houses to the algorithmic Louvain division, it was also clear that the house division did not fit the data well. We could tell this from comparing the modularity. This meant our initial hypothesis did not hold. This is likely because the data contains many more characters than just the ones in the Harry Potter books. There are three movies, the 'Fantastic beasts' series, with many characters from America and the rest of the world that are contained in the 'Unknown' house. Knowing this it is quite understandable that the school houses are too simple of a division. Furthermore, in our analysis of mixing patterns we found that nodes of the same houses do not tend to reference each other more than a random network would. Though this is surprising, it further explains why house community division was far worse than the Louvain division.
We did still analyze the texts from the characters in the 'Hogwarts' houses to see if some of our initial thoughts held up. We expected to see different words in focus for the four houses. However, many of the words were the same, like 'Harry' and 'Hogwarts'. There were some differences in the words with lower TF-IDF scores, but generally the house wordclouds did not show significant differences in house themes. The biggest standout was the word 'Newt' in the 'Hufflepuff' wordcloud. As 'Newt Scamander' is the main character of the 'Fantastic beasts' series his importance makes a lot of sense.
Creating individual wordclouds showed interesting results. There were clearer differences in the words, and they matched with our expectations for the characters. The words with highest TF-IDF scores were names, which is understandable as names are pretty unique and often not used in many contexts. Hereafter the wordclouds mostly showed nouns and other factual words. This makes sense for a wikipedia page. However, since the wiki pages are written by fans we could not have simply expected factual, formal language and an analysis like the one we made is usefull to understand the trustworthiness of the pages. They seems to thouroughly describe the events in the lore, without presenting subjective perspectives. However, an add-on for the project could be to make a semantic sentiment analysis, which we would expect would come out neutral.
Diving back into our network analysis, we clearly showed that the Harry Potter network differentiates itself from a random network. Just looking at the in- and out-degrees it is clear that some characters are very central while others are without much importance. This distribution was very different in the random network. The clustering coefficient supported this discovery. With a clustering coefficient of 0.3 for the Harry Potter network and 0.003 for the random network, the Harry Potter universe has many more clusters than possible if the characters were randomly connected. This high difference was likely based on the 200 nodes in small fully connected clusters. A add-on to the investigation could have been to only look at the largest connected component. From our assortativity measures, we did not see exactly what we expected. Nodes with some in-degree did not seem particularly connected to other nodes with the same in- or out-degree, but the assortativity was more significantly different from random for the out-degree of the source nodes. The measures looking at the in-degree of the source node were still about twice as large as the same measures for the random network, however from the numbers alone it is hard to tell if this is significant. This is one of the main limitations of our network analysis. Though, we have compared all results to a random network, a better analysis would look at distributions of random networks. By computing statistics on 100+ random networks, we could have found the 95% confidence intervals for the different metrics and concluded if the Harry Potter network metrics were statistically different from one. This could be an interesting add-on for a future project. Furthermore, we could have made a randomization experiment to investigate if the modularity measure for both the house division and the louvain division are different from a random graph. This could be done with double edge swap and would also be and interesting add-on.
We computed the average shortest path of the Harry Potter network and the random network and again found different values, with the Harry Potter network having the shortest paths. This analysis has some major issues. Firstly, the largest connected component of the random network is three times as big as the Harry Potter network. This means that the maximum possible shortest path will be much larger in the random network than in the Harry Potter network. The second issue with this metric is that it considers the weights of the edges in a way were a high weight is considered a longer path. However, with the way we modelled our network a high weight means that the characters are strongly connected. A better way to consider shortest path, would have been to evaluate the negative of the weights or inverse the weights. This issue moves into the closeness centrality metric, where we found that the random network contains more nodes with a high closeness centrality score than the Harry Potter network. It does seem possible since the random network has lower average degrees, and is therefore likely better connected across the whole graph. However, due to the issues decribed above, we cannot trust these metrics.
The initial assumption that we made when going into the project was, that the Harry Potter universe functions as a real network given the description of the well-documented world on the Wiki website Harry Potter fandom wiki. When considering the network generated on the Harry Potter data, and comparing it to a random network of the same size, it is clear that the Harry Potter universe mirrors more realistic connections. The clustering coefficient supports this statement as it reveals how characters in the Harry Potter network tend to cluster in groups due to larger average clustering coefficient. Opposite of this shows the closeness centrality, that nodes appear to be more central in the random network than on the Harry Potter network. However, the closeness centrality is not a reliable measure in our analysis, as it is based on the shortest path including weights as longer paths, when a high weight in our network is in fact a strong connection. The degree analysis reinforce that the random network is more interferingly connected among all nodes since the mode of both in- and out-degree is 4 compared to 0 and 1 for the Harry Potter network. Another thing about the interference of the random network is seen from the size of the largest strongly connected component in each network, for which the component of the random network is 3 times as large as the component for the Harry Potter network. Considering that clustering is very present in a real world social network, we can conclude that all statements above supports the fact that the Harry Potter universe looks like a true social network in which characters are mostly connected in smaller groups of people. Nevertheless, the text analysis reveals the most central characters in the universe. Starting from the length of the body text on their Wiki page, in which the main character Harry Potter has the longest and a more anonymous character has the shortest. But when investigating both house-split communities and TD-IDF scores for Wiki texts of characters in the respective houses, we see that houses do not play that big a part in the entire universe. The analysis of individual word clouds among the top 10 characters reveals that the the Wiki website Harry Potter fandom wiki states factual and objective description which contributes to a well-made analysis. To improve upon the analysis of thiw project, more randomization tests could have been included to really distinquish the network analysis metrics from random network distributions. Furthermore, the shortest path metric could have used inverted weights to represent the actual meaning of the weights in the network.